This collection contains various datasets generated from the "early web" era (1996-1999) of the Internet Archive's global web archive collection.
This dataset consists of multi-language URLs of the early web (1996-1999), grouped by common URL patterns. “Parallel language” refers to the same text represented in different languages; multi-language text from websites are a rich source for parallel language corpora and can be valuable in work such as machine translation. The URLs and patterns were derived and extracted from the Internet Archive’s Wayback Machine web archive, for all archived webpages successfully captured before year 2000. In total, 1,164,183 such parallel, multi-lingual records are in this dataset. Additional details on format, statistics, samples, et cetera can be found in the
dataset’s README file.
This dataset consists of metadata records identifying the language of webpages from the early web (1996-1999) portion of the Internet Archive’s web archive collection. The metadata includes specific capture information related to a URL along with the top 3 language annotations for each page, as detected by Google’s
Compact Language Detector (CLD3), a neural network model for language identification. Confidence scores are computed according to Internet Archive’s own Square-Leaf-Model, based on a combination of the scores provided by CLD3, applied to each HTML DOM-tree leaf node text, and the length of the analyzed text snippets. There is an overall number of 4,383,611 language annotated records in this dataset. Additional details on format, statistics, samples, et cetera can be found in the
dataset's README file.